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Abstract 


This paper presents the method of potential func- 
tions using B-splines as potential functions in 
the estimation of likelihood functions (probabil- 
ity density functions conditioned on pattern 
classes) or of the resulting discriminant func- 
tions. Integrated mean square consistency of 
this technique is discussed. Experimental 
results of using the likelihood functions thus 
obtained in the classification of remotely 
sensed data are given. 

*** 

The method of "potential functions" (also called 
"kernel functions") for the direct construction 
of likelihood functions and discriminant func- 
tions has been widely discussed in the literature 
on statistics and pattern classification (see 
for example [4] and the references therein) . 

In what follows, first we review very briefly 
this method. Second, we present it3 integrated- 
mean-square (IMS) consistency and give a formula 
for the value of the mesh parameter h(N) (to be 
defined in section 2) which is optimal with re- 
spect to IMS convergence. Next, we discuss the 
use of multivariate B-splines as potential func- 
tions, bringing into the discussion the IMS 
consistency criteria mentioned above. Finally, 
we present some of the experimental results ob- 
tained when likelihood functions constructed 
by means of B-spline potential functions were 
used to classify remotely sensed data pertaining 
to the Purdue LARS flight line Cl. 

1. Likelihood Functions and Discriminant Func - 
tions in Pattern Classification 


As a preamble to our results, let us briefly 
recall the Bayesian solution to the pattern 
classification problem. 


Pjf x (x/H^)-P i f x (x/H i )> 0, ijtj, i=l, . . . ,n (1) 
The left side of (1) 

gj i <x)= P j f X (x/Hi)-P i f x (x/Hi) (2) 

is called a discriminant function. Since (1) 
is equivalent to 

gji(x) =log(f x (x/Hj)/f x (x/H i )) + log(P j /P i ) > 0, 


gj^(x) is also sometimes called a discriminant 
function. 


For j = 1, ..., M, let there be given the n- 
vectors '(j)_ (j) (jh (j) 

,y ln 


11 


^ J>=col <\P 

j J 


(j) 


Nj n 


) constituting 


the training set Tj (N<) belonging to the pattern 
class hJ. The problems to which we will be 
addressing are: 


(a) Given Tj(N-j) construct an estimate 

fjKx/Hj^Nj)) of f X (x/HJ); 

(b) Given Tj(Nj) and T i (Ni) construct an 
estimate 


gj^xiT^Nj),^^)) of 8ji (x). 

For simplicity in notation, from now on we will 
drop the superscript and subscript j whenever it 
is clear that we are referring to the estimation 
of a likelihood function pertain . ,ig A to a given 
class Hi, and rewrite f x (x/HJ) and f x (x/HJ ,Tj (Nj)) 
Simply as f X (x) and ? x (x/T(N)) > respectively. 

2. The Method of Potential Functions 


Suppose that observations made on patterns, 
which are to be classified as pertaining to one 
of the pattern classes H*, H M , appear as 

n-vectors belonging to the real Euclid/an space 
R n . Then any given observation x = col (x-^, ..., 

Xjj) may be viewed as a realization of a random 
vectdr X = col (X]_, X n ) . Associated with 

each pattern class hJ, j = 1, ..., M, there is 
the conditional probability density function** 
f x (x/HJ) , called the likelihood function for 
the class H-*, and the prior probability H-^ for 
that class. The Bayes decision rule, which 
minimizes the probability of misclassif ication, 
consists of classifying any observed x as arising 
from hJ if 

* Supported by 

+ Supported by the NSF Grant GK-36375. 

+ Supported by ONR Grant NR-042-283. 

** Even though f(x) is the correct notation for the 
use the same notation f(x) for the "function" and 
the context. 


We will now indicate how the method of potential 
functions is used in the solution of problems 
(a) and (b) above. 

V 

According to this method, in the solution of prob- 
lem (a), the estimate of f x (x/T(N)) is constructed 
in the form „ 7 

- _i ^ 

f y (x/T(N)) = N Z cp(x,y ), (3) 

X k=l K 

where 9(x,z) called a "potential function" or 
"kernel function" is a real-valued function of 
the n-vectors x and z, satisfying appropriate 
conditions. 


value" at x of a function f or f(.), we often 
"function values" when the meaning is clear from 
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'For the one-dimensional case, i.e. for x and y^ 

* n R , Parzen [7] was one of the first investiga- 
tors to suggest the construction of a probability 
density function using (3) and for this reason 

(3) is often called a Parzen estimator of the 
probability density function 

Parzen suggested specifically potential functions 
of the form 

cp(x,z) = h" 1 (N)K(h“ 1 (N)(x-z)), (4) 

where h(N), is a "mesh parameter", dependent on 
N, sufficiently small so as to validate the 
assumption of f x (x) being nearly constant on 
any interval (z-h(N) , zfh(N)). Parzen [7] gave 
conditions on K and h(N), which guarantee mean 
square consistency of (3) for a wide class of 
densities. He also gave a formula for optimal 
h(N), i.e. the values of h(N) , n = 1, 2, ...» 
which maximize the rate of convergence of an 
approximation to the mean square error to zero. 

Parzen 1 s results were extended to the n-dimen- 
sional case by Murthy [6] and Cacoullos [1] . 

Note that in this case, we have in general n 
scalar mesh parameters h^(N), ..., h n (N) and 

(4) is replaced by 

cp(x,z)=(h 1 (N)h 2 (N)...h n (N))‘ 1 K(H' 1 (N)(x-z)) t (5) 
where 


H(N) = Diag(h 1 (N), ..., h n (N)) (6) 

However, by suitable normalization, we may make 
all h t (N), i = 1, . .., n the same, i. e. 

h^N) = h 2 (N) = ... = h n (N) « h(N), (7) 

which when substituted in (5) leads to 

«p(jc,z)»h' n (N)K(h' 1 (N)(x- Z )). (8) 

So from now on, without loss in generality we 
will assume equation (7) and equation (8) hold. 

Referring next to Problem (b) , it is clear that 
the above technique can be used to estimate gj^(x) 
directly from the training sets Tj (Nj) and T^INi) . 
In fact, let the elements of T j (Nj)Ux^(N i ) , 
ordered in any arbitrary way. Be labeled z^, 
k = 1, ...» NjfN^, and define the function 



where, for l = j,i 


(9) 






0, otherwise. 


Then substituting (3) in (2) and using (9), we 
obtain 




N.+N. 

- J- J- 


^(x,z k ) (ly 


It clearly follows from this definition that our 
consistency results developed for (3) apply also 


to (11). 


3. Integrated Mean Square Consistency 
mality 


and Opti- 


The above-mentioned consistency and optimality 
results are with respect to the mean square 
error 


E((f x (x/T(N» - f x (x)) 2 } 


( 12 ) 


for a given x. However, in approximating f x 
one is most likely to have some a priori know- 
ledge of the "global behavior", like "global 
smoothness", of the function to be approximated 
rather than its "local behavior". In such cir- 
cumstances, the integrated mean square (IMS) 
error criterion in the coice of optimal h(N) be- 
comes more meaningful as we shall explain in 
section 4. The IMS error is defined by 

V(T(N)) = jE((f x (x/T(N)) -f(x)) 2 }dx, (13) 

where the integration is performed over R n . 

The following result is proved in the Appendix. 

Theorem 1. Suppose 

(I) The random samples y 1# . . . , y^ are 
independently and identically distributed as X 
whose density is f x ; 

(II) f x € L 2 (R n ); 

(III) K:R n - > R + (where R**” = set of nonnegative 
real numbers) is such that 

(III-l) K€L 2 (R n ), 

(III-2) J* K(x)dx = 1, 

R 

(III-3) ess sup K(x) < ”, 
x£ R n 

(III-4) lim |lx||K(x) = 0, 

where ||x|| = (x^ 2 , + ... + x^)^, and 

(IV) h(N) is such that 
(IV-1) lim h(N) = 0 

and 

(IV -2) lim Nh n (N) = « . 

A 

Then f x (x/T(N)) is a consistent estimator of 
f x (x) in the IMS error sense, that is 

lim V (T(N)) = 0 . 

N"*» 

We next seek a formula for the value h*(N) 
of h(N), whicl^ optimizes the IMS rate of con- 
vergence, of f x to f x . This is obtained by 
modifying Cacoullos' [1] result for the mean 
square convergence case as follows: 
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Theorem 2 . Let the hypotheses of Theorem 1 hold 
and assume, in addition, that K is symmetric 
(i.e. K(-x) = K(x) ) and f x is thrice differen- 
tiable and such that the second partial derivative 
of f x are in L2(R n ). Then within o(h 4 (N)) the 
IMS error (defined by (13)) is minimized by 
choosing 

.2 


h*(N) = N" 1/( ^ 


nllKli: 


n+4 


n 

£ p, 


a 2 f. 


(14) 


IZl 3-l^i ; V X ) .2 1 


where, for a function g, 

ilgll k =(J n lsCx) | k dx) 1/k 

R 

and 


“lj = In x i X J K(lt)dX 

K 


(15) 


(16) 


The optimal rate of convergence, corresponding 
to the choice (14), is 


V* (T(N) ) = (4" 1 n+l)n 0+4 N 0+4 


n n 

£ Z , 




8 

n+4 


A ^ 


+ o(h* 4 (N)). 


(17) 


If, as we shall assume in the following section, 
the kernel K has the product form 

K(x) « n K 0 (x ) , (18) 

i=l U 1 

where K n is an even one -dimensional kernel, then 
Theorem 2 further simplifies to: 

Theorem 3: Under the condition on K just stated, 
the results (14) and (17) of the preceding theorem 
assume the forms (19) and (20) below: 


h*(N) = N 


1 

n+4 


*01^ 


4„ n 



1 

n+4 


(19) 


i = l 


V*(T(N) - (4'Vl)n N ^ A^f^AjCKg), (20) 


where 


4 - L x t K o (x i )dx i 


A i^V _ I * >2 

1 x 1 t=l 5x. 


and 


yv 



(21) 

( 22 ) 

(23) 


We will call A 2 (K 0 ) the optimal, kernel-depende nt 

convergence since it represents the part 
of the right side“f (20) which depends on K Q . 

We omit proofs of theorems 2 and 3 above since 
they represent straightforward extensions of those 

in [11. 


4. B-Spllnea as Potential Functions 

It is clear from the formulas in (19) and (20) 
that the choice of the optimal h* (N) , and hence 
V*(T(N)), depends on the properties of f x and 
on the structure of the product kernel K. 

The L 2 norm of the second derivative of a func- 
tion represents a measure of the ’’global smooth- 
ness” of the function. In this sense, according 
to (22), A.(f x ) is a monotonically increasing 
function or tne "global smoothness" of Thus 

an a priori knowledge of the smoothness can be 
incorporated in the formulas (19), (22), and (20) 
by assigning a value to 


n 

S 

i-1 


a 2f x 


2n 

n+4 


in those formulas. 

In picking K one would like to choose its struc- 
ture so that the optimal kernel-dependent rate of 
convergence ^(Kg^ is minimized. Then the 


choice of the support of 


K o 


represents a compro- 


mise between the minimization of its second moment 
and its \r norm. Kq must also be at least as 
smooth as f^, particularly if only a few train- 
ing samples are available. Based on these con- 
siderations, we suggest for the structure of K_ 
a univariate B-spline, and hence for K a product 
of n such splines. Such a choice for Kq 
constitutes a compromise between a Gaussian 
kernel and a square kernel 


w = \ 


l x i 1 < 1 

otherwise, 


(24) 


We note that a multivariate polynomial, as sug- 
gested by Specht [10], while certainly adequate 
for approximating a large class of discriminant 
functions given in the form (la), is certainly 
unsuitable for the representation of K over 
the entire R n since it violates the conditions 
(III) of Theorem 1. 

Let any given component variable x. of x be 
denoted by | . Then a univariate 1 B-spline of. 
degree m-1 with support on (?g,^) and knots 


*0< h< —<■'* 


is defined by [2,3,9]: 


M m C5) - £ W5 », 

i=0 L 


where 


and 


«>(o = 

Uoo, if g(?) > o. 


s + «> = { 


1° 


if g(S) < 0 . 


(25) 


(26) 


(27) 


(28) 


If we : (1) assume that the degree of the B-spline 
is odd, i.e., r = m-1 = 2k-l, k a positive in- 
teger; (2) center the 3pline about the origin as 
required by Theorems 2 and 3; and (3) let the knots 
of the spline occur at integer values; then we may 
obtain the B-spline representation for K rt (see 
[9]). 0 

K 0 <5> - M r+1 ( ? ) = £ 1 (-l)** V t . 0 r {2 
^=-k U+k } + 



( 35 ) 


where, -as indicated above, m= r+l = 2k , 


Substituting (29) in (21), (23) and then (19) and 
(20), we obtain the formula for optimal h*(N) 
and the optimal kernel-dependent rate of conver- 
gence AjOCq): 

i 


h* (N) = N 


1 

it+4 


144n(M (0)) 11 

r?f~2 


i=l 


Sx' 


m4- 


and 


where 


W ^12^ 


i 2 


2n/(h+4) 4n/(h+4) 


(30) 


(31) 


¥ = r r s 1 (-l)j +r+1 / 2(r+1) '\l 2r+1 

V T2m)T » ( y+r+1) ) J 


(32) 


Numerical values for /L(Kq) given by (31), for 
r = 1,3, and 5, are listed in Table I. 


TABLE I 


r 

V K o> 

i 

.353075 

3 

.357836 

5 

.359683 


Let ??(m,,£;x) denote the value at x of the 
normal density with mean p, and covariance matrix 
£ . For Kq(x) = 7?(0,cr;x) we have 

!l*?? (2) (0,a 2 ; -)\\\ = (3/8lT*V 5 ). (33) 


Using (33), with a = 1, in (30), we get the 
formulas for h*(N), for any dimension n and 
r = 1,3, presented in Table II. 


TABLE II 


r 

h*(N) 

! i 

0 

36n(. 66666) n ] ltfn+4) 

.2115 J 

3 


f 9n(,49365) n } 1 /^+ 4 ) 

L .2115 J 


5 . Computer Simulation Results 

In this section we present some of the computer 
simulation results performed on tha Rice Univer- 
sity IBM 370/155 digital computer for the purpose 
of testing how well the B-spline potential function 
algorithm performs in the construction of likeli- 
hood functions. 

Given a set of samples T(N) = {y 1 ,...,y n ), where 
each y^ is an independent realization of a 
random variable X with density f^, let the ^ 
sample mean \x and sample covariance matrix £ 
be defined in the usual way, i.e. 

1 N 

(34) 



i=i 


where the superscript T denotes the transpose. 
Then ^(p.Zjx) will he called the "sample 
normal density". 

The simulation results mentioned above are shown 
in Figs. 1 through 6. 


Fig. 1 is a graphic display of the bimodal 
density 

f x (x) = p9?(p. 1 ,E;x) + (l-p)7?(^ 2 ,E;x) (36) 


with the mixing parameter p - 
col.(-2,0), \i.. = col(6,0) and 
^5.75 4.34‘ 


£ = 


■4.34 6.64 


and = 


(37) 


Figs. 2 and 3 are displays of f (x/T(N)) 
obtained by the B-spline potential function 
algorithm corresponding respectively to N=50 
and N = 300 samples from the above density. 
Fig. 4 shows the sample normal density approxi- 
mation of the same density on the basis of 50 
samples . 


To show the effect of the increase in dimension- 
ality on the performance of our algorithm, we 
present in Figs. 5 and 6 the cross-sections 
through the x. -axis of the density estimates, 
obtained by the B-spline potential function 
algorithm, of the bimodal density (36) on four- 
and six-dimensional spaces under the conditions 
given in those figures. 

In all this work we used cubic B-splines with 
the mesh parameter h*(N) equal to the second 
entry in Table II. 


From the above few results we conclude that the 
B-spline potential function algorithm appears 
to fare well in the construction of likelihood 
functions from only a modest num r of samples 
and with densities that are not necessarily 'uni- 
modal and on spaces that are not necessarily of 
too low a dimension. 


6 . Application to Remote Sensing 

To test the effectiveness of the B-spline 
potential function algorithm for classification, 
discriminant functions were obtained from the 
likelihood functions generated by the algorithm, 
for Bayesian classification of agricultural 
crops. The algorithm was based on cubic B-splines 
chosen as in section 4 of this paper, and was 
implemented in the LARSYSAA VERSION 2 ; 1 L.;h was 
developed at the Laboratory for Applications of 
Remote Sensing, Purdue University, Lafayette, 
Indiana. We also performed classification using 
sample normal densities as likelihood functions. 
(See the beginning of section 5 for the defini- 
tion of "sample normal density.") 

The data used in our experiments pertained to 
the Purdue LARS flight line Cl, which has been 
widely employed for testing algorithms on 
remote sensing. This data consists of the output 
of a twelve channel spectrometer which analyzes 
the reflected radiance from the object being 
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sensed. Let a given crop field belonging, say, 
to the pattern class Hi, be discretized (parti- 
tioned) into N, points, called "resolution 
elements". The 1 ^ spectrometer maps each resolu- 
tion element into, a 12 -tuple of real numbers, 

say y U = col(y£| y k 12^’ and the ** ole 

field provides 1C such 12 -dimensional vector 
samples yP ; , . . .,y:i d } , which can be used to train 
a classifier in the- 1 acquisition of the likelihood 
function corresponding to H- 1 . 


is also a pleasure to acknowledge a helpful 
comment from Professor Polking on the proof of 
Theorem 1 and the encouragement received in 
carrying out this research from Dr. M. S. Lynn, 
Director of the Rice University Institute for 
Computer Services and Applications. 

Appendix 

Proof of Theorem 1 


Our first example is designed to show the poor 
results obtained when normality is assumed on 
bimodal data. In this example, we used data from 
only one channel, namely Channel 1 (.40m to .44m), 
to classify data corresponding to the two bimodal 
pattern classes: H 1 : RED CLOVER HAY and C0RN1; 
and H z : BARE S0IL1 and ALFALFA1. Figs. 7 and 8 
show the histograms of the classes. The per- 
centages of the number of correct classifications, 
for a typical set of observations corresponding 
to H 1 and H z , are indicated in Table III, both 
for the algorithm presented here and for the 
sample normal classification algorithm. 


TABLE III 



Potential 

Sample 


Function 

Normal 


Algorithm 

Algorithm 

H 1 

86$ 

26$ 

H 2 

98$ 

.3$ 


It is clear that the much superior performance 
of the potential function algorithm in relation 
to the sample normal algorithm may be attributed 
to the bi -modality of the data. 

Our second example is for the purpose of testing 
the effectiveness of the potential function algo- 
rithm for normal data. In this example, we used 
3 channels, namely Channels 1 (.40m - .44m.), 10 
(,66 m “ »72m), 12 (.60m “ 1.00 m), to classify 

H A ; SOYBEANS, H 2 : CORN, H 3 : OATS, and H 4 : WHEAT. 
The percentages of correct classifications are 
displayed in Table IV, for H 1 and H 2 . 


For notational convenience let 

^(x) = h'^Ch^x). (A-l) 

Then, clearly* (see for example [5, p. 172)), 
E((f x (x/T(N)) - f x (x)) 2 } 

=Var(f x (x/T(N))) 4- B 2 {f x (x/T(N))), (A-2) 

where 

Var(f x (x/T(N)) = E(f 2 (x/T(N) )} 

- E 2 {f x (x/T(N))} (A-3) 

and A A 

B(f x (x/T(N))) = E(f x (x/T(N))} - f x (x).(A-4) 

Hence it follows that 

limllEt (f (x/T(N)) - f (x)) 2 }fc 
N-w® x XI 

= lim||Var(f (x/T (N)))|L 

1 

+ limllB(f (x/T(N)))|| 2 . (A-5) 

N-* A Z 

The proof will consist of showing that each term 
on the right side of (A-5) tends to zero. 

Since the random variables Y. , i = 1,...,N 
are each independently distributed as X, we have 
in accordance with (3 ), that 

E(f x (x/T(N))) = EfK^Cx - X)}, (A-6) 


TABLE IV 


drid a 7 i 

Var{f (x/T (N) )) » N _ Var{K h (x-X)) 

Potential 

Sample 

Function 

Normal 

Now 

Algorithm 

Algorithm 

K h (x)*f x (X) S 

H 1 97$ 

99$ 

R 

H 2 94$ 

99$ 

= E{K h (x-X)) 


(A-7) 


Even though the efficiency of classification by 
the potential function algorithm is lower than by 
the sample normal, we note that the ability of 
the potential function algorithm to classify 
effectively data that is normal is comparable 
with the sample normal in quality of classifica- 
tion. 

Acknowledgement 

We are much indebted to Dr. D. Van Rooy for im- 
plementing the classification algorithm into the 
LARSYSAA system and to Mr. Ken Baker of the 
Lyndon B, Johnson Space Center for his advice 
on the remote sensing problem investigated. It 


= E{f(x/T(N))}, (A-8 ) 

where, in going from the third to the last 
member, we have used (A-6). 

Simi larly , 

*In this Appendix, we use capital letters for 
symbols denoting random variables and corres- 
ponding small letters for realizations of these 
random variables. In (A-2), f (x/T(N)) is to 
be regarded as a function of tfie random variables 
Yj,...,Yjj the realizations of which are the 
training samples y^,...,y^. 
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